-
Notifications
You must be signed in to change notification settings - Fork 653
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FIX-#4522: Correct multiindex metadata with groupby #4523
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: Devin Petersohn <[email protected]>
Signed-off-by: Devin Petersohn <[email protected]>
Signed-off-by: Devin Petersohn <[email protected]>
Codecov Report
@@ Coverage Diff @@
## master #4523 +/- ##
===========================================
- Coverage 86.34% 61.57% -24.77%
===========================================
Files 228 228
Lines 18438 18454 +16
===========================================
- Hits 15920 11363 -4557
- Misses 2518 7091 +4573
📣 Codecov can now indicate which changes are the most critical in Pull Requests. Learn more |
Fix looks good, but introduces a bug when we have a column that shares the name of the index. In pandas, that fails like so: In [1]: import pandas as pd
In [2]: df = pd.DataFrame([[1, 2, 3]], index=pd.Index([0], name="so"), columns=['so', 'b', 'c'])
In [3]: df.groupby(by=['so', 'b']).count()
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Input In [3], in <cell line: 1>()
----> 1 df.groupby(by=['so', 'b']).count()
File ~/.miniconda3/envs/modin/lib/python3.8/site-packages/pandas/core/frame.py:7712, in DataFrame.groupby(self, by, axis, level, as_index, sort, group_keys, squeeze, observed, dropna)
7707 axis = self._get_axis_number(axis)
7709 # https://github.com/python/mypy/issues/7642
7710 # error: Argument "squeeze" to "DataFrameGroupBy" has incompatible type
7711 # "Union[bool, NoDefault]"; expected "bool"
-> 7712 return DataFrameGroupBy(
7713 obj=self,
7714 keys=by,
7715 axis=axis,
7716 level=level,
7717 as_index=as_index,
7718 sort=sort,
7719 group_keys=group_keys,
7720 squeeze=squeeze, # type: ignore[arg-type]
7721 observed=observed,
7722 dropna=dropna,
7723 )
File ~/.miniconda3/envs/modin/lib/python3.8/site-packages/pandas/core/groupby/groupby.py:882, in GroupBy.__init__(self, obj, keys, axis, level, grouper, exclusions, selection, as_index, sort, group_keys, squeeze, observed, mutated, dropna)
879 if grouper is None:
880 from pandas.core.groupby.grouper import get_grouper
--> 882 grouper, exclusions, obj = get_grouper(
883 obj,
884 keys,
885 axis=axis,
886 level=level,
887 sort=sort,
888 observed=observed,
889 mutated=self.mutated,
890 dropna=self.dropna,
891 )
893 self.obj = obj
894 self.axis = obj._get_axis_number(axis)
File ~/.miniconda3/envs/modin/lib/python3.8/site-packages/pandas/core/groupby/grouper.py:872, in get_grouper(obj, key, axis, level, sort, observed, mutated, validate, dropna)
870 if gpr in obj:
871 if validate:
--> 872 obj._check_label_or_level_ambiguity(gpr, axis=axis)
873 in_axis, name, gpr = True, gpr, obj[gpr]
874 if gpr.ndim != 1:
875 # non-unique columns; raise here to get the name in the
876 # exception message
File ~/.miniconda3/envs/modin/lib/python3.8/site-packages/pandas/core/generic.py:1794, in NDFrame._check_label_or_level_ambiguity(self, key, axis)
1786 label_article, label_type = (
1787 ("a", "column") if axis == 0 else ("an", "index")
1788 )
1790 msg = (
1791 f"'{key}' is both {level_article} {level_type} level and "
1792 f"{label_article} {label_type} label, which is ambiguous."
1793 )
-> 1794 raise ValueError(msg)
ValueError: 'so' is both an index level and a column label, which is ambiguous. while it now succeeds in this pr: In [1]: import modin.pandas as pd
In [2]: df = pd.DataFrame([[1, 2, 3]], index=pd.Index([0], name="so"), columns=['so', 'b', 'c'])
UserWarning: Ray execution environment not yet initialized. Initializing...
To remove this warning, run the following python code before doing dataframe operations:
import ray
ray.init()
UserWarning: On Macs, Ray's performance is known to degrade with object store size greater than 2.0 GiB. Ray by default does not allow setting an object store size greater than that. Modin is overriding that default limit because it would rather have a larger, slower object store than spill to disk more often. To override Modin's behavior, you can initialize Ray yourself.
UserWarning: Distributing <class 'list'> object. This may take some time.
In [3]: df.groupby(by=['so', 'b']).count()
Out[3]:
c
so b
1 2 1 this is probably a simple fix though - we can just check if this is happening in the API level so that lower layers know that if the index and column share the same name, its due to a deferred index propagation. |
Signed-off-by: Devin Petersohn <[email protected]>
Signed-off-by: Devin Petersohn <[email protected]>
Signed-off-by: Rehan Durrani <[email protected]>
@@ -1570,7 +1570,7 @@ def test_agg_exceptions(operation): | |||
}, | |||
], | |||
) | |||
def test_to_pandas_convertion(kwargs): | |||
def test_to_pandas_conversion(kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quick fix for a typo I noticed when updating the testing suite!
modin/pandas/dataframe.py
Outdated
@@ -415,7 +415,6 @@ def groupby( | |||
) | |||
else: | |||
squeeze = False | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: why are we removing the empty line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will remove!
Signed-off-by: Rehan Durrani <[email protected]>
…into bugs/4522 Signed-off-by: Rehan Durrani <[email protected]>
Signed-off-by: Rehan Durrani <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
modin/pandas/dataframe.py
Outdated
level_name, index_name = "an index", "a column" | ||
if axis == 1: | ||
level_name, index_name = index_name, level_name | ||
raise ValueError( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: it's possible for multiple k
to be invalid and cause us to raise a ValueError, however currently only one such k
will be raised. Is it worth collecting all invalid k
and raising a ValueError if any k
are invalid?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked, and in pandas, only the first invalid k
is raised!
@@ -2032,3 +2032,70 @@ def test_sum_with_level(): | |||
} | |||
modin_df, pandas_df = pd.DataFrame(data), pandas.DataFrame(data) | |||
eval_general(modin_df, pandas_df, lambda df: df.set_index("C").groupby("C").sum()) | |||
|
|||
|
|||
def test_reset_index_groupby(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
without context, this test is unclear. Could you add a brief description of what error condition this is testing or link to the gh issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, can do!
modin/pandas/test/test_groupby.py
Outdated
), | ||
).add_prefix("col") | ||
pandas_df.index.names = [f"index_{i}" for i in range(len(pandas_df.index.names))] | ||
# Convert every other column to string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: convert every even column to string
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will change!
for col in pandas_df.iloc[ | ||
:, [i for i in range(len(pandas_df.columns)) if i % 2 == 0] | ||
]: | ||
pandas_df[col] = [str(chr(i)) for i in pandas_df[col]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would be helpful if you could describe the schema for pandas_df
here, to make the results of the above computation clearer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense!
Signed-off-by: Rehan Durrani <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Signed-off-by: Rehan Durrani <[email protected]>
There was another bug, where if we were passed a pandas.Series or Index to |
Signed-off-by: Rehan Durrani <[email protected]>
Signed-off-by: Rehan Durrani <[email protected]>
modin/pandas/dataframe.py
Outdated
# We don't need to check if `by` is a Series or Index, since those | ||
# won't be referencing labels | ||
if not isinstance(by, (pandas.Series, Series, pandas.Index)): | ||
_by_check = by if is_list_like(by) else [by] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it okay to rename this from _by_check
to _by_list
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think that makes sense to me!
Signed-off-by: Rehan Durrani <[email protected]>
We've decided to revert back the fix that makes us handle |
Signed-off-by: Devin Petersohn [email protected]
What do these changes do?
flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
git commit -s
reset_index
followed bygroupby
causes exception in some cases #4522docs/development/architecture.rst
is up-to-date